The whole process of my attempt to predict the Hotel Occupancy Rate (TPK BPS) was carried out using Jupyter Notebook version 6.1.6 on Python 3.8.2 x64 for Windows.
These are the libraries I used in this competition:
- Pandas, for the data processing using table-like form
- Numpy, for the data processing using array-like form
- Scikit-learn, for the machine learning tasks
- Plotly, for data graphing
- Matplotlib for data plotting
The committee only gave us Daily Hotel Occupancy Rate retrieved online (tpk_harian) as X variable and Monthly Hotel Occupancy Rate published by BPS (tpk_bps) as Y variable The lack of data encouraged me to get other sources as follows:
- covid_harian_aktif = daily covid active cases, retrieved from KawalCovid19
- covid_harian = daily new covid cases, retrieved from KawalCovid19
- covid_total = total cases of covid at the end of the month (last day), Retrieved from KawalCovid19
- penerbangan = the number of flight passengers to Bali, retrieved from bali.bps.go.id
- wisatawan = the number of domestic tourists coming to Bali, retrieved from bps.go.id and from Contact Person from Disparda Bali (Dinas Pariwisata Bali)
- wisatawan_mancanegara = the number of foreign tourists coming to Bali, retrieved from bali.bps.go.id
- tpk_bps_arima = the data of Monthly Hotel Occupancy Rate published by BPS (tpk_bps) from the previous months (y-1)
- hari = the number of days in a month
- mobility = google mobility data index for INDONESIA (not Bali in particular), retrieved from OurWorldInData.Org
As for the models, here are the ones I tried running on the data:
- Linear Regression
- Ridge Regression
- Random Forest Regressor
- Support Vector Regression (SVR)
- K-Nearest Neighbor Regressor
- MLPRegressor (Neural Network Regression)
- Lasso Regression
- Decision Tree Regressor
# Modules installation, if prompted
# ! pip install pandas
# ! pip install sklearn
# ! pip install plotly
# ! pip install matplotlib
# ! pip install voila
# Import modules and give aliases to it
# Modules for data reading
import pandas as pd
import sklearn as sk
import numpy as np
# Modules for data graphing
import plotly.graph_objects as go
import matplotlib.pyplot as plt
# Modules for model building
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Modules for model testing
import math
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
# Reading CSV Files, the data used in this project
# Data for Dependent Variable
tpk_bps = pd.read_csv('Datasets/train-TPK_Hotel_berbintang_2020.csv') # from the committee
# Data for Independent Variables
tpk_harian = pd.read_csv('Datasets/train-online_booking_2020.csv') # from the committee
penerbangan = pd.read_csv('Datasets/train-penerbangan_2020.csv') # number of flight passengers, from BPS website
wisatawan = pd.read_csv('Datasets/train-wisatawan_domestik_2020.csv') # from Disparda Bali
# Remove possible nan value for data with the original form of daily historical data
tpk_harian.dropna()
| province | type | tanggal | klasifikasi | all_available_room | room_total | |
|---|---|---|---|---|---|---|
| 0 | Bali | Hotel | 1/1/2020 | Rated | 2155 | 37756 |
| 1 | Bali | Hotel | 1/1/2020 | Not Rated | 163 | 392 |
| 2 | Bali | Hotel | 1/2/2020 | Rated | 3878 | 38814 |
| 3 | Bali | Hotel | 1/2/2020 | Not Rated | 211 | 415 |
| 4 | Bali | Hotel | 1/3/2020 | Rated | 5412 | 39084 |
| ... | ... | ... | ... | ... | ... | ... |
| 691 | Bali | Hotel | 12/29/2020 | Not Rated | 475 | 568 |
| 692 | Bali | Hotel | 12/30/2020 | Rated | 24240 | 45600 |
| 693 | Bali | Hotel | 12/30/2020 | Not Rated | 427 | 551 |
| 694 | Bali | Hotel | 12/31/2020 | Rated | 26194 | 46764 |
| 695 | Bali | Hotel | 12/31/2020 | Not Rated | 441 | 523 |
696 rows × 6 columns
# This function is retrieved from https://gist.github.com/Xylambda/b8f38dce74dd3d54ff906eebfe560ac0
# which uses Fast Fourier Transform to denoise the possible noisy data on covid_harian, covid_harian_aktif and tpk_harian
def fft_denoiser(x, n_components, to_real=True):
"""Fast fourier transform denoiser.
Denoises data using the fast fourier transform.
Parameters
----------
x : numpy.array
The data to denoise.
n_components : int
The value above which the coefficients will be kept.
to_real : bool, optional, default: True
Whether to remove the complex part (True) or not (False)
Returns
-------
clean_data : numpy.array
The denoised data.
References
----------
.. [1] Steve Brunton - Denoising Data with FFT[Python]
https://www.youtube.com/watch?v=s2K1JfNR7Sc&ab_channel=SteveBrunton
"""
n = len(x)
# compute the fft
fft = np.fft.fft(x, n)
# compute power spectrum density
# squared magnitud of each fft coefficient
PSD = fft * np.conj(fft) / n
# keep high frequencies
_mask = PSD > n_components
fft = _mask * fft
# inverse fourier transform
clean_data = np.fft.ifft(fft)
if to_real:
clean_data = clean_data.real
return clean_data
The original data does not have tpk_online, below is the function to calculate it based on the available rooms and total rooms in the data.
# Show information on data types
tpk_harian.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 696 entries, 0 to 695 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 province 696 non-null object 1 type 696 non-null object 2 tanggal 696 non-null object 3 klasifikasi 696 non-null object 4 all_available_room 696 non-null int64 5 room_total 696 non-null int64 dtypes: int64(2), object(4) memory usage: 32.8+ KB
# Change the datatype of variable 'tanggal' into datetime (YEAR-MONTH-DATE)
tpk_harian['tanggal'] = pd.to_datetime(tpk_harian['tanggal'], format="%m/%d/%Y")
# Show first five rows of the data frame
tpk_harian.head()
| province | type | tanggal | klasifikasi | all_available_room | room_total | |
|---|---|---|---|---|---|---|
| 0 | Bali | Hotel | 2020-01-01 | Rated | 2155 | 37756 |
| 1 | Bali | Hotel | 2020-01-01 | Not Rated | 163 | 392 |
| 2 | Bali | Hotel | 2020-01-02 | Rated | 3878 | 38814 |
| 3 | Bali | Hotel | 2020-01-02 | Not Rated | 211 | 415 |
| 4 | Bali | Hotel | 2020-01-03 | Rated | 5412 | 39084 |
# Show information on data types
tpk_harian.dtypes
province object type object tanggal datetime64[ns] klasifikasi object all_available_room int64 room_total int64 dtype: object
# Calculate used/booked rooms on tpk_harian
# the total of rooms minus the available rooms on that day (not booked)
tpk_harian['used_room'] = tpk_harian.room_total-tpk_harian.all_available_room
# Calculate the value of tpk_online (the percentage
# of booked rooms on total rooms
tpk_harian['tpk_online'] = tpk_harian.used_room/tpk_harian.room_total*100
# Shows data tpk_harian of which variables tpk_online and used room have been added into
tpk_harian
| province | type | tanggal | klasifikasi | all_available_room | room_total | used_room | tpk_online | |
|---|---|---|---|---|---|---|---|---|
| 0 | Bali | Hotel | 2020-01-01 | Rated | 2155 | 37756 | 35601 | 94.292298 |
| 1 | Bali | Hotel | 2020-01-01 | Not Rated | 163 | 392 | 229 | 58.418367 |
| 2 | Bali | Hotel | 2020-01-02 | Rated | 3878 | 38814 | 34936 | 90.008760 |
| 3 | Bali | Hotel | 2020-01-02 | Not Rated | 211 | 415 | 204 | 49.156627 |
| 4 | Bali | Hotel | 2020-01-03 | Rated | 5412 | 39084 | 33672 | 86.152901 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 691 | Bali | Hotel | 2020-12-29 | Not Rated | 475 | 568 | 93 | 16.373239 |
| 692 | Bali | Hotel | 2020-12-30 | Rated | 24240 | 45600 | 21360 | 46.842105 |
| 693 | Bali | Hotel | 2020-12-30 | Not Rated | 427 | 551 | 124 | 22.504537 |
| 694 | Bali | Hotel | 2020-12-31 | Rated | 26194 | 46764 | 20570 | 43.986827 |
| 695 | Bali | Hotel | 2020-12-31 | Not Rated | 441 | 523 | 82 | 15.678776 |
696 rows × 8 columns
# Calculate upper quantile of the data
max_thresold = tpk_harian['tpk_online'].quantile(0.90)
max_thresold
50.37341670604346
# Calculate lower quantile of the data
min_thresold = tpk_harian['tpk_online'].quantile(0.10)
min_thresold
13.100003163420439
# Remove outliers from upper and lower quantiles
tpk_harian = tpk_harian[(tpk_harian['tpk_online']<max_thresold) & (tpk_harian['tpk_online']>min_thresold)]
# Denoise the possible noisy data
tpk_harian['tpk_online_denoised_fft'] = fft_denoiser(tpk_harian['tpk_online'], 10, to_real=True)
tpk_harian['tpk_online_denoised_fft']
<ipython-input-16-e0d2ec7322ec>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy tpk_harian['tpk_online_denoised_fft'] = fft_denoiser(tpk_harian['tpk_online'], 10, to_real=True)
3 48.018343
5 42.963230
7 37.568041
9 35.800638
11 34.379302
...
691 16.033055
692 47.276324
693 23.166606
694 44.618422
695 15.458028
Name: tpk_online_denoised_fft, Length: 556, dtype: float64
# Prepare variable 'tanggal' as index for grouping the data based on month (the following code)
tpk_harian = tpk_harian.set_index('tanggal')
# Make a new data frame, named covid_cases_agg
# which groups the data into months using the median on the data, and 'tanggal' as the index
tpk_harian_agg = tpk_harian.groupby(pd.Grouper(freq='M')).median()
Sementara pake Mean, jika dirasa ada metode agregasi lain yang lebih mewakili, bisa dicoba disini
# Reset the index on new data frame and show the datatypes
tpk_harian_agg = tpk_harian_agg.reset_index()
tpk_harian_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 |
| 1 | 2020-02-29 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 |
| 2 | 2020-03-31 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 |
| 3 | 2020-04-30 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 |
| 4 | 2020-05-31 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 |
| 5 | 2020-06-30 | 33407.0 | 50624.0 | 16181.0 | 32.457119 | 32.635825 |
| 6 | 2020-07-31 | 34482.0 | 51258.0 | 15414.0 | 30.432621 | 30.308360 |
| 7 | 2020-08-31 | 2274.0 | 3793.0 | 1519.0 | 32.988680 | 32.097701 |
| 8 | 2020-09-30 | 29066.0 | 49852.0 | 19863.0 | 37.913000 | 37.222877 |
| 9 | 2020-10-31 | 29403.5 | 49999.5 | 18812.0 | 35.942146 | 35.518501 |
| 10 | 2020-11-30 | 27918.0 | 48904.0 | 19868.0 | 39.083442 | 38.770749 |
| 11 | 2020-12-31 | 522.0 | 650.0 | 146.0 | 23.817292 | 24.476106 |
# Show shape (dimension) of array
tpk_harian_agg.shape
(12, 6)
# Add a new column named 'Id' on the data frame tpk_harian_agg,
# value ranges from 1 to 12
# The display the data
tpk_harian_agg['Id'] = np.arange(start=1, stop=13, step=1)
tpk_harian_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | Id | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 | 1 |
| 1 | 2020-02-29 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 | 2 |
| 2 | 2020-03-31 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 | 3 |
| 3 | 2020-04-30 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 | 4 |
| 4 | 2020-05-31 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 | 5 |
| 5 | 2020-06-30 | 33407.0 | 50624.0 | 16181.0 | 32.457119 | 32.635825 | 6 |
| 6 | 2020-07-31 | 34482.0 | 51258.0 | 15414.0 | 30.432621 | 30.308360 | 7 |
| 7 | 2020-08-31 | 2274.0 | 3793.0 | 1519.0 | 32.988680 | 32.097701 | 8 |
| 8 | 2020-09-30 | 29066.0 | 49852.0 | 19863.0 | 37.913000 | 37.222877 | 9 |
| 9 | 2020-10-31 | 29403.5 | 49999.5 | 18812.0 | 35.942146 | 35.518501 | 10 |
| 10 | 2020-11-30 | 27918.0 | 48904.0 | 19868.0 | 39.083442 | 38.770749 | 11 |
| 11 | 2020-12-31 | 522.0 | 650.0 | 146.0 | 23.817292 | 24.476106 | 12 |
# Add a new column named 'Id' on the data frame penerbangan
# value ranges from 1 to 12
# The display the data
penerbangan['Id'] = np.arange(start=1, stop=13, step=1)
penerbangan
| tanggal_ter | penerbangan | Id | |
|---|---|---|---|
| 0 | 1-Jan-20 | 1094169 | 1 |
| 1 | 1-Feb-20 | 772595 | 2 |
| 2 | 1-Mar-20 | 527776 | 3 |
| 3 | 1-Apr-20 | 50874 | 4 |
| 4 | 1-May-20 | 4047 | 5 |
| 5 | 1-Jun-20 | 12273 | 6 |
| 6 | 1-Jul-20 | 43492 | 7 |
| 7 | 1-Aug-20 | 84721 | 8 |
| 8 | 1-Sep-20 | 81321 | 9 |
| 9 | 1-Oct-20 | 99562 | 10 |
| 10 | 1-Nov-20 | 169895 | 11 |
| 11 | 1-Dec-20 | 189485 | 12 |
# Add a new column named 'Id' on the data frame wisatawan
# value ranges from 1 to 12
# The display the data
wisatawan['Id'] = np.arange(start=1, stop=13, step=1)
wisatawan
| tanggal_wis | wisatawan | Id | |
|---|---|---|---|
| 0 | 1-Jan-20 | 879702 | 1 |
| 1 | 1-Feb-20 | 721105 | 2 |
| 2 | 1-Mar-20 | 567452 | 3 |
| 3 | 1-Apr-20 | 175120 | 4 |
| 4 | 1-May-20 | 101948 | 5 |
| 5 | 1-Jun-20 | 137395 | 6 |
| 6 | 1-Jul-20 | 229112 | 7 |
| 7 | 1-Aug-20 | 355732 | 8 |
| 8 | 1-Sep-20 | 283349 | 9 |
| 9 | 1-Oct-20 | 337304 | 10 |
| 10 | 1-Nov-20 | 425097 | 11 |
| 11 | 1-Dec-20 | 382841 | 12 |
# Join all the data frames (used or not used in the final models)
# based on the column ID into a new data frame named tpk_join.
# And display the first five rows of the data.
tpk_join = pd.merge(tpk_bps,tpk_harian_agg,on='Id',how='left')
tpk_join = pd.merge(tpk_join,penerbangan,on='Id',how='left')
tpk_join = pd.merge(tpk_join,wisatawan,on='Id',how='left')
tpk_join.head()
| Id | Provinsi | Tahun | Bulan | Aggregate_var | TPK | tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | tanggal_ter | penerbangan | tanggal_wis | wisatawan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Bali | 2020 | Januari | NaN | 59.29 | 2020-01-31 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 | 1-Jan-20 | 1094169 | 1-Jan-20 | 879702 |
| 1 | 2 | Bali | 2020 | Februari | NaN | 45.98 | 2020-02-29 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 | 1-Feb-20 | 772595 | 1-Feb-20 | 721105 |
| 2 | 3 | Bali | 2020 | Maret | NaN | 25.41 | 2020-03-31 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 | 1-Mar-20 | 527776 | 1-Mar-20 | 567452 |
| 3 | 4 | Bali | 2020 | April | NaN | 3.22 | 2020-04-30 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 | 1-Apr-20 | 50874 | 1-Apr-20 | 175120 |
| 4 | 5 | Bali | 2020 | Mei | NaN | 2.07 | 2020-05-31 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 | 1-May-20 | 4047 | 1-May-20 | 101948 |
# Delete (drop) unneccessary variables
tpk_join = tpk_join.drop(['tanggal', 'Aggregate_var', 'Provinsi'], axis=1)
# menamplilkan lima data teratas dari tpk_join
tpk_join.head()
| Id | Tahun | Bulan | TPK | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | tanggal_ter | penerbangan | tanggal_wis | wisatawan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2020 | Januari | 59.29 | 459.0 | 766.0 | 279.0 | 37.106918 | 37.433877 | 1-Jan-20 | 1094169 | 1-Jan-20 | 879702 |
| 1 | 2 | 2020 | Februari | 45.98 | 433.5 | 713.0 | 284.0 | 39.448381 | 39.570870 | 1-Feb-20 | 772595 | 1-Feb-20 | 721105 |
| 2 | 3 | 2020 | Maret | 25.41 | 629.5 | 758.0 | 186.0 | 24.497126 | 25.584338 | 1-Mar-20 | 527776 | 1-Mar-20 | 567452 |
| 3 | 4 | 2020 | April | 3.22 | 21190.0 | 33285.0 | 9520.0 | 22.779846 | 22.753135 | 1-Apr-20 | 50874 | 1-Apr-20 | 175120 |
| 4 | 5 | 2020 | Mei | 2.07 | 16659.0 | 32539.0 | 15707.0 | 30.792608 | 31.062169 | 1-May-20 | 4047 | 1-May-20 | 101948 |
# Save the resulted data frame tpk_join into csv file in my local
tpk_join.to_csv('Results/SVR/tpk_join_v999991.csv')
This step hopes to see the rough pattern between tpk_bps (Y variable) against every independent variables. Library scatter plot (imported in the beginning of this source code file) is used for this purpose.
# Display scatter plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['tpk_online_denoised_fft'],
mode='lines+markers',
name='TPK Online'))
fig.show()
# Display scatter plot
# the variable 'penerbangan' is divided by 10000 to make it easier in understanding the graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['penerbangan']/10000,
mode='lines+markers',
name='Flight Passengers'))
fig.show()
# Display scatter plot
# the variable 'wisatawan' is divided by 10000 to make it easier in understanding the graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['TPK'],
mode='lines+markers',
name='TPK BPS'))
fig.add_trace(go.Scatter(x=tpk_join['Id'], y=tpk_join['wisatawan']/10000,
mode='lines+markers',
name='Domestic Tourists'))
fig.show()
The following is to test the correlations between variable Y and each variable X.
# Choose X variables and Y variable from data frame tpk_join
tpk_bps = tpk_join['TPK']
tpk_online = tpk_join['tpk_online_denoised_fft']
penerbangan = tpk_join['penerbangan']
wisatawan = tpk_join['wisatawan']
# Correlation check
np.corrcoef(tpk_online, tpk_bps)
array([[1. , 0.33625855],
[0.33625855, 1. ]])
# Correlation check
np.corrcoef(penerbangan, tpk_bps)
array([[1. , 0.98720155],
[0.98720155, 1. ]])
# Correlation check
np.corrcoef(wisatawan, tpk_bps)
array([[1. , 0.94955001],
[0.94955001, 1. ]])
# Display scatter plots for variables with strong correlations (tpenerbangan, wisatawan, tpk_bps_arima)
plt.scatter(penerbangan, tpk_bps)
<matplotlib.collections.PathCollection at 0x179bb6def10>
# Convert dataframe into numpy arrays
datatpk = tpk_online.to_numpy().reshape(-1,1)
dataterbang = penerbangan.to_numpy().reshape(-1,1)
datawisata = wisatawan.to_numpy().reshape(-1,1)
# Combine the chosen variables as DataX
dataX = np.hstack([datatpk,datawisata,dataterbang])
dataX
# tpk_bps
array([[3.74338774e+01, 8.79702000e+05, 1.09416900e+06],
[3.95708699e+01, 7.21105000e+05, 7.72595000e+05],
[2.55843384e+01, 5.67452000e+05, 5.27776000e+05],
[2.27531346e+01, 1.75120000e+05, 5.08740000e+04],
[3.10621692e+01, 1.01948000e+05, 4.04700000e+03],
[3.26358252e+01, 1.37395000e+05, 1.22730000e+04],
[3.03083601e+01, 2.29112000e+05, 4.34920000e+04],
[3.20977009e+01, 3.55732000e+05, 8.47210000e+04],
[3.72228774e+01, 2.83349000e+05, 8.13210000e+04],
[3.55185007e+01, 3.37304000e+05, 9.95620000e+04],
[3.87707490e+01, 4.25097000e+05, 1.69895000e+05],
[2.44761062e+01, 3.82841000e+05, 1.89485000e+05]])
# Read CSV file
tpk_harian_test = pd.read_csv('Datasets/test-online_booking_2021.csv')
# Remove nan value
tpk_harian_test.dropna()
# Change the datatype of variable 'tanggal' into datetime (YEAR-MONTH-DATE)
tpk_harian_test['tanggal'] = pd.to_datetime(tpk_harian_test['tanggal'], format="%m/%d/%Y")
# Calculate used_room based based on available room and total room
# And use it to calculate tpk_online (used room percentage)
tpk_harian_test['used_room'] = tpk_harian_test['room_total']-tpk_harian_test['all_available_room']
tpk_harian_test['tpk_online'] = tpk_harian_test['used_room'] / tpk_harian_test['room_total']*100
# Denoise the possible noisy data covid_harian
tpk_harian_test['tpk_online_denoised_fft'] = fft_denoiser(tpk_harian_test['tpk_online'], 10, to_real=True)
tpk_harian_test['tpk_online_denoised_fft']
0 40.222881
1 11.626967
2 39.564663
3 10.403010
4 40.100590
...
351 13.441035
352 40.089164
353 14.315314
354 37.725997
355 13.463850
Name: tpk_online_denoised_fft, Length: 356, dtype: float64
# Prepare variable 'tanggal' as index for grouping the data based on month (the following code)
tpk_harian_test = tpk_harian_test.set_index('tanggal')
# Make a new data frame, named tpk_harian_test_agg
# which groups the data into months using the median on the data, and 'tanggal' as the index
tpk_harian_test_agg = tpk_harian_test.groupby(by=tpk_harian_test.index.month).mean()
# Resets the index on new data frame and show the datatypes
tpk_harian_test_agg = tpk_harian_test_agg.reset_index()
tpk_harian_test_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | |
|---|---|---|---|---|---|---|
| 0 | 1 | 17361.016129 | 27941.919355 | 10580.903226 | 23.613467 | 23.584929 |
| 1 | 2 | 15538.517857 | 26191.535714 | 10653.017857 | 26.480454 | 26.423566 |
| 2 | 3 | 16422.633333 | 26036.833333 | 9614.200000 | 23.083330 | 23.261825 |
| 3 | 4 | 17021.983333 | 27078.533333 | 10056.550000 | 24.765484 | 24.696790 |
| 4 | 5 | 17185.810345 | 27388.155172 | 10202.344828 | 25.102572 | 25.034087 |
| 5 | 6 | 16729.866667 | 27310.683333 | 10580.816667 | 26.392109 | 26.431097 |
# Add a new column named 'Id' on the data frame tpk_harian_agg,
# value ranges from 1 to 6
# The display the data
tpk_harian_test_agg['Id'] = np.arange(start=1, stop=7, step=1)
tpk_harian_test_agg
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | Id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 17361.016129 | 27941.919355 | 10580.903226 | 23.613467 | 23.584929 | 1 |
| 1 | 2 | 15538.517857 | 26191.535714 | 10653.017857 | 26.480454 | 26.423566 | 2 |
| 2 | 3 | 16422.633333 | 26036.833333 | 9614.200000 | 23.083330 | 23.261825 | 3 |
| 3 | 4 | 17021.983333 | 27078.533333 | 10056.550000 | 24.765484 | 24.696790 | 4 |
| 4 | 5 | 17185.810345 | 27388.155172 | 10202.344828 | 25.102572 | 25.034087 | 5 |
| 5 | 6 | 16729.866667 | 27310.683333 | 10580.816667 | 26.392109 | 26.431097 | 6 |
# Read the csv file
penerbangan_test = pd.read_csv('Datasets/test-penerbangan_2021.csv')
# Add a new column named 'Id' on the data frame penerbangan
# value ranges from 1 to 6
# The display the data
penerbangan_test['Id'] = np.arange(start=1, stop=7, step=1)
penerbangan_test
| tanggal_ter | penerbangan | Id | |
|---|---|---|---|
| 0 | 1-Jan-21 | 119160 | 1 |
| 1 | 1-Feb-21 | 71122 | 2 |
| 2 | 1-Mar-21 | 117088 | 3 |
| 3 | 1-Apr-21 | 142329 | 4 |
| 4 | 1-May-21 | 121076 | 5 |
| 5 | 1-Jun-21 | 226287 | 6 |
# Read the csv file
wisatawan_test = pd.read_csv('Datasets/test-wisatawan_domestik_2021.csv')
# Add a new column named 'Id' on the data frame wisatawan
# value ranges from 1 to 6
# The display the data
wisatawan_test['Id'] = np.arange(start=1, stop=7, step=1)
wisatawan_test
| tanggal_wis | wisatawan | Id | |
|---|---|---|---|
| 0 | 1-Jan-21 | 282248 | 1 |
| 1 | 1-Feb-21 | 240608 | 2 |
| 2 | 1-Mar-21 | 305579 | 3 |
| 3 | 1-Apr-21 | 330593 | 4 |
| 4 | 1-May-21 | 363959 | 5 |
| 5 | 1-Jun-21 | 498852 | 6 |
# Join all the data frames (used or not used in the final models)
# based on the column ID into a new data frame named tpk_join_test.
# And display the first five rows of the data.
tpk_join_test = pd.merge(tpk_harian_test_agg,penerbangan_test,on='Id',how='left')
tpk_join_test = pd.merge(tpk_join_test,wisatawan_test,on='Id',how='left')
tpk_join_test
| tanggal | all_available_room | room_total | used_room | tpk_online | tpk_online_denoised_fft | Id | tanggal_ter | penerbangan | tanggal_wis | wisatawan | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 17361.016129 | 27941.919355 | 10580.903226 | 23.613467 | 23.584929 | 1 | 1-Jan-21 | 119160 | 1-Jan-21 | 282248 |
| 1 | 2 | 15538.517857 | 26191.535714 | 10653.017857 | 26.480454 | 26.423566 | 2 | 1-Feb-21 | 71122 | 1-Feb-21 | 240608 |
| 2 | 3 | 16422.633333 | 26036.833333 | 9614.200000 | 23.083330 | 23.261825 | 3 | 1-Mar-21 | 117088 | 1-Mar-21 | 305579 |
| 3 | 4 | 17021.983333 | 27078.533333 | 10056.550000 | 24.765484 | 24.696790 | 4 | 1-Apr-21 | 142329 | 1-Apr-21 | 330593 |
| 4 | 5 | 17185.810345 | 27388.155172 | 10202.344828 | 25.102572 | 25.034087 | 5 | 1-May-21 | 121076 | 1-May-21 | 363959 |
| 5 | 6 | 16729.866667 | 27310.683333 | 10580.816667 | 26.392109 | 26.431097 | 6 | 1-Jun-21 | 226287 | 1-Jun-21 | 498852 |
tpk_join_test.to_csv('Results/SVR/tpk_join_test_v999991.csv')
# Convert dataframe into numpy arrays
datatpk_test = tpk_join_test['tpk_online'].to_numpy().reshape(-1,1)
dataterbang_test = tpk_join_test['penerbangan'].to_numpy().reshape(-1,1)
datawisata_test = tpk_join_test['wisatawan'].to_numpy().reshape(-1,1)
# Combine the chosen variables as DataX_test
dataX_test = np.hstack([datatpk_test,datawisata_test,dataterbang_test])
dataX_test
array([[2.36134668e+01, 2.82248000e+05, 1.19160000e+05],
[2.64804538e+01, 2.40608000e+05, 7.11220000e+04],
[2.30833299e+01, 3.05579000e+05, 1.17088000e+05],
[2.47654844e+01, 3.30593000e+05, 1.42329000e+05],
[2.51025722e+01, 3.63959000e+05, 1.21076000e+05],
[2.63921091e+01, 4.98852000e+05, 2.26287000e+05]])
# Read csv file
data_true = pd.read_csv('Datasets/sample-submission_webbali.csv')
# Get the true Y
tpk_true = data_true['TPK']
The following are the models I tried running on the data:
- Linear Regression
- Ridge Regression
- Random Forest Regressor
- Support Vector Regression (SVR)
- K-Nearest Neighbor Regressor
- MLPRegressor (Neural Network Regression)
- Lasso Regression
- Decision Tree Regressor
However, SVR gives the best RMSE on my project, therefore in this file I only include the code for SVR. The codes for other models can be found on file Nofriani_model_trials_v999991.ipynb
# Import model library
from sklearn.svm import SVR
# Create model instance
svr = SVR()
# Set model hyper parameters
parameters= {
'kernel' : ['linear'],
# 'degree' : [2], # only significant for poly and sigmoid
'gamma' : ['auto'],
'tol' : [1.5],
'C' : [1],
'epsilon' : [3],
'shrinking' : [False],
'cache_size' : [200],
'shrinking' : [False],
'max_iter' : [100]
},
# Make model pipeline
model = make_pipeline(StandardScaler(),GridSearchCV(svr,parameters,scoring='neg_mean_gamma_deviance', cv=12, verbose=1))
# Train model
model.fit(X=dataX, y=tpk_bps)
# Predict y
tpk_prediksi = model.predict(dataX)
# Display prediction in a graph
plt.figure(figsize=(12, 3))
plt.plot(tpk_online, label='TPK Online')
plt.plot(tpk_bps, label='TPK BPS')
plt.plot(tpk_prediksi, label='Prediksi dari Model')
plt.legend()
plt.show()
Fitting 12 folds for each of 1 candidates, totalling 12 fits
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,13)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/SVR/hasilv999991-svr-training.csv', index=None)
# Visualize data in matplotlib
temp = pd.DataFrame()
temp['tpk_online'] = tpk_online
temp['tpk_bps'] = tpk_bps
temp['tpk_prediksi'] = tpk_prediksi
temp = temp.sort_values(by='tpk_online')
plt.figure(figsize=(12,3))
plt.plot(temp['tpk_online'], temp['tpk_bps'], 'bo')
plt.plot(temp['tpk_online'], temp['tpk_prediksi'])
[<matplotlib.lines.Line2D at 0x179bba08a00>]
# Calculate model accuracy
model_svr = make_pipeline(StandardScaler(), SVR(kernel='linear', C=1, gamma='auto', tol=1.5, epsilon=3, shrinking=False,
cache_size=200, max_iter=100))
score_svr = cross_val_score(model_svr, dataX, tpk_bps, cv=12, scoring='neg_mean_gamma_deviance', verbose=1)
np.mean(abs(score_svr))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.0s finished
0.32353938262930093
## Predict y based on X data set
tpk_prediksi = model.predict(dataX_test)
tpk_prediksi
# true: 11.15 8.99 10.24 10.09 10.35 16.68
array([10.00895387, 8.13057591, 10.49184745, 11.27913083, 11.57598922,
15.95167003])
# Calculate Root Mean Squared Error (RMSE, the scoring in Kaggle)
mse = mean_squared_error(tpk_true, tpk_prediksi)
rmse_svr = math.sqrt(mse)
rmse_svr
0.961905177379139
# Calculate Mean Absolute Percentage Error (MAPE)
mape_svr = mean_absolute_percentage_error(tpk_true, tpk_prediksi)
mape_svr
0.08374976356222269
# Calculate Mean Absolute Error (MAE)
mae_svr = mean_absolute_error(tpk_true,tpk_prediksi)
mae_svr
0.8992946156370799
# Display Predicted Y and True Y in a graph
plt.figure(figsize=(12, 3))
plt.plot(datatpk_test, label='TPK Online Test')
plt.plot(tpk_true, label='TPK BPS True')
plt.plot(tpk_prediksi, label='TPK BPS Prediksi')
plt.legend()
plt.show()
# Make data frame based on prediction
hasil = pd.DataFrame()
hasil['Id'] = np.arange(1,7)
hasil['TPK'] = tpk_prediksi
# Save the result into csv file
hasil.to_csv('Results/SVR/hasilv999991-svr.csv', index=None)
# Comparing RMSE with the original obtained by instructor during Workshop
math.sqrt(68.88720689917587)
8.299831739208685